This notebook outlines how to simultaneously derive population and individual redshift distributions from a given collection of redshift PDFs using some of the tools available in frankenz
.
In [1]:
from __future__ import print_function, division
import sys
import pickle
import numpy as np
import scipy
import matplotlib
from matplotlib import pyplot as plt
from six.moves import range
# import frankenz code
import frankenz
# plot in-line within the notebook
%matplotlib inline
np.random.seed(265)
In [2]:
# re-defining plotting defaults
from matplotlib import rcParams
rcParams.update({'xtick.major.pad': '7.0'})
rcParams.update({'xtick.major.size': '7.5'})
rcParams.update({'xtick.major.width': '1.5'})
rcParams.update({'xtick.minor.pad': '7.0'})
rcParams.update({'xtick.minor.size': '3.5'})
rcParams.update({'xtick.minor.width': '1.0'})
rcParams.update({'ytick.major.pad': '7.0'})
rcParams.update({'ytick.major.size': '7.5'})
rcParams.update({'ytick.major.width': '1.5'})
rcParams.update({'ytick.minor.pad': '7.0'})
rcParams.update({'ytick.minor.size': '3.5'})
rcParams.update({'ytick.minor.width': '1.0'})
rcParams.update({'axes.titlepad': '15.0'})
rcParams.update({'font.size': 30})
For our proof-of-concept tests, we will use the same mock SDSS data as earlier.
In [3]:
downsample = 10 # downsampling the population
survey = pickle.load(open('../data/mock_sdss_cww_bpz.pkl', 'rb')) # load data
types = survey.data['types'][::downsample]
templates = survey.data['templates'][::downsample]
redshifts = survey.data['redshifts'][::downsample]
mags = survey.data['refmags'][::downsample]
Nobs = len(types)
The redshift distribution of our sample is plotted below.
In [4]:
# histogram
dzbin = 0.1
zbins = np.arange(-1, 7.+1e-5, dzbin) # redshift bins
zbins_mid = 0.5 * (zbins[1:] + zbins[:-1]) # bin midpoints
Nbins = len(zbins) - 1
# KDE
dzgrid = 0.01
zgrid = np.arange(-1., 7.+1e-5, dzgrid)
Ngrid, smooth = len(zgrid), 0.05
pdf = frankenz.pdf.gauss_kde(redshifts, np.ones(Nobs) * smooth, zgrid)
pdf /= np.trapz(pdf, zgrid)
# plotting
plt.figure(figsize=(14, 6))
plt.hist(redshifts, bins=zbins, histtype='stepfilled', lw=5,
color='blue', alpha=0.7, density=True, edgecolor='black')
plt.plot(zgrid, pdf, lw=5, color='red', alpha=0.8)
plt.hist(zgrid + 1e-5, bins=zbins, weights=pdf, histtype='step', lw=5,
color='red', alpha=0.5, density=True)
plt.xlabel('Redshift')
plt.xlim([zbins[0], zbins[-1]])
plt.yticks([])
plt.ylabel('$P(z|\mathbf{g})$')
plt.tight_layout()
For each object, we will generate Gaussian redshift PDFs with variable widths. As shown in the previous notebook, these satisfy all statistical constraints to be "PDFs" while remaining straightforward and easy to work with.
In [5]:
# generate Gaussian PDFs over grid
sigma = np.random.uniform(0.05, 0.2, size=Nobs) # width
mu = np.random.normal(redshifts, sigma) # noisy observation
zpdf = np.array([frankenz.pdf.gaussian(mu[i], sigma[i], zgrid)
for i in range(Nobs)]) # redshift pdfs
zpdf /= np.trapz(zpdf, zgrid)[:,None] # normalizing
# generate PDFs over bins
zpdf_bins = np.array([frankenz.pdf.gaussian_bin(mu[i], sigma[i], zbins)
for i in range(Nobs)]) # redshift pdfs
zpdf_bins /= zpdf_bins.sum(axis=1)[:,None] * dzbin # normalizing
In [6]:
# plot some PDFs
plt.figure(figsize=(20, 12))
Nfigs = (3, 3)
Nplot = np.prod(Nfigs)
colors = plt.get_cmap('viridis')(np.linspace(0., 0.7, Nplot))
idxs = np.random.choice(Nobs, size=Nplot)
idxs = idxs[np.argsort(redshifts[idxs])]
for i, (j, c) in enumerate(zip(idxs, colors)):
plt.subplot(Nfigs[0], Nfigs[1], i + 1)
plt.plot(zgrid, zpdf[j], color=c, lw=4)
plt.hist(zbins_mid, zbins, weights=zpdf_bins[j],
color=c, lw=4, alpha=0.5, histtype='step')
plt.vlines(redshifts[j], 0., max(zpdf[j] * 1.2), color='red',
lw=3)
plt.xlim([-0.5, 6])
plt.ylim([0.03, None])
plt.xlabel('Redshift')
plt.yticks([])
plt.ylabel('PDF')
plt.tight_layout()
As discussed previously, for a sample with objects $g \in \mathbf{g}$ each with an associated redshift estimate $z_g$ with PDF $P(z_g | z)$, the posterior estimate for the population redshift distribution $P(z|\mathbf{g})$ can be written using using Bayes Theorem as
$$ P(\rho|\lbrace p_g \rbrace) \propto P(\lbrace p_g \rbrace | \rho) P(\rho) $$where $\rho \equiv P(z|\mathbf{g})$ and $p_g \equiv P(z|g)$. Assuming independence, the first term becomes
$$ P(\lbrace p_g \rbrace | \rho) = \prod_{g} P(p_g|\rho) = \prod_g \int P(z_g|z) P(z|\mathbf{g}) dz $$In other words, the posterior probability for the population redshift distribution $P(z|\mathbf{g})$ is based on how much it overlaps the most with each of the individual redshift PDFs $P(z_g | z)$ (with some prior).
This result means that the population distribution is not what you get by stacking the individual redshift PDFs:
$$ P(z|\mathbf{g}) \neq \frac{1}{N_\mathbf{g}}\sum_g P(z_g | z) $$since there is no guarantee $\frac{1}{N_\mathbf{g}}\sum_g P(z_g | z)$ will maximize the posterior probability. This is demonstrated for us above.
In [7]:
# plotting
plt.figure(figsize=(14, 6))
plt.plot(zgrid, pdf, lw=5, color='black', alpha=0.8,
label='Truth (KDE)')
plt.plot(zgrid, zpdf.sum(axis=0) / Nobs, lw=5, color='blue',
alpha=0.6, label='Stacked PDFs')
plt.hist(redshifts, bins=zbins, histtype='step', lw=5,
color='black', alpha=0.7, density=True)
plt.hist(zbins_mid, bins=zbins, weights=zpdf_bins.sum(axis=0) / Nobs,
histtype='step', lw=5,
color='blue', alpha=0.5, density=True)
plt.xlabel('Redshift')
plt.xlim([zgrid[0], zgrid[-1]])
plt.yticks([])
plt.ylabel('$N(z|\mathbf{g})$')
plt.ylim([0., None])
plt.legend(fontsize=28, loc='best')
plt.tight_layout()
Intuitively, this just is telling us that we are dealing with a deconvolution problem -- we need to "take out" the error on the individual PDFs to get the population distribution right.
As discussed earlier, attempts to deal with this problem after stacking don't work. Three prototypical examples are shown below.
In [8]:
# number of samples
Nsamples = 500
In [9]:
# draw Poisson samples
pdf1 = zpdf_bins.sum(axis=0) # stack PDFs
pdf1 /= pdf1.sum() # normalize
pdf1 *= Nobs # transform to counts
pdf1_samples = np.array([np.random.poisson(pdf1)
for i in range(Nsamples)]) # draw samples
In [10]:
# draw multinomial samples
pdf2_samples = np.random.multinomial(Nobs, pdf1 / pdf1.sum(),
size=Nsamples) # samples
In [11]:
# draw posterior samples
pdf3_samples = np.zeros_like(pdf1_samples)
zpdf_norm = zpdf_bins / zpdf_bins.sum(axis=1)[:, None]
for j in range(Nsamples):
if j % 50 == 0:
sys.stderr.write(' {0}'.format(j))
for i in range(Nobs):
# stack categorial draw
pdf3_samples[j] += np.random.multinomial(1, zpdf_norm[i])
In [12]:
def zplot_bin(samples, label='type', color='blue', downsample=5):
"""Plot our binned draws."""
[plt.hist(zbins_mid + 1e-5, zbins,
weights=samples[i], lw=3,
histtype='step', color=color, alpha=0.05)
for i in np.arange(Nsamples)[::downsample]]
plt.hist(zgrid, weights=pdf*1e-5, lw=3, histtype='step',
color=color, alpha=0.6, label=label)
h = plt.hist(redshifts, zbins,
histtype='step', lw=6, color='black', alpha=0.7)
plt.xlabel('Redshift')
plt.xlim([-0.5, 4])
plt.yticks([])
plt.ylim([0, max(h[0]) * 1.2])
plt.ylabel('$N(z|\mathbf{g})$')
plt.legend(fontsize=26, loc='best')
plt.tight_layout()
def cov_draws(samples, bin1=(14, 16), bin2=(16, 18), color='blue',
label='label', xlim=None, ylim=None):
"""Plot our draws within two bins."""
# Bin results.
n, _ = np.histogram(redshifts, bins=zbins)
pdf_bin1 = n[bin1[0]:bin1[1]].sum() / n.sum() * Nobs / 1e3
pdf_bin2 = n[bin2[0]:bin2[1]].sum() / n.sum() * Nobs / 1e3
samples_bin1 = samples[:, bin1[0]:bin1[1]].sum(axis=1) / 1e3
samples_bin2 = samples[:, bin2[0]:bin2[1]].sum(axis=1) / 1e3
# Plot results.
plt.vlines(pdf_bin1, 0, 100, lw=2, colors='black', linestyles='--')
plt.hlines(pdf_bin2, 0, 100, lw=2, colors='black', linestyles='--')
plt.plot(pdf_bin1, pdf_bin2, 's', color='black', markersize=20)
plt.plot(samples_bin1, samples_bin2, 'o', color=color,
label=label, markersize=8, alpha=0.4)
if xlim is None:
plt.xlim([min(pdf_bin1, min(samples_bin1)) - 0.1,
max(pdf_bin1, max(samples_bin1)) + 0.1])
else:
plt.xlim(xlim)
if ylim is None:
plt.ylim([min(pdf_bin2, min(samples_bin2)) - 0.1,
max(pdf_bin2, max(samples_bin2)) + 0.1])
else:
plt.ylim(ylim)
plt.xlabel(r'$N({:6.1f}\leq z < {:6.1f}) \quad [10^3]$'.format(zbins[bin1[0]],
zbins[bin1[1]]))
plt.ylabel(r'$N({:6.1f}\leq z < {:6.1f}) \quad [10^3]$'.format(zbins[bin2[0]],
zbins[bin2[1]]))
plt.legend(fontsize=28, loc=2)
plt.tight_layout()
In [13]:
# plotting
plt.figure(figsize=(30, 6))
plt.subplot(1, 3, 1)
zplot_bin(pdf1_samples, label='Poisson', color='blue')
plt.subplot(1, 3, 2)
zplot_bin(pdf2_samples, label='Multinomial', color='red')
plt.subplot(1, 3, 3)
zplot_bin(pdf3_samples, label='PDF Draws', color='darkviolet')
In [14]:
# plotting binned covariance
plt.figure(figsize=(30, 9))
plt.subplot(1, 3, 1)
cov_draws(pdf1_samples,
xlim=(2.8, 3.6), ylim=(2.8, 3.3),
color='blue', label='Poisson')
plt.subplot(1, 3, 2)
cov_draws(pdf2_samples,
xlim=(2.8, 3.6), ylim=(2.8, 3.3),
color='red', label='Multinomial')
plt.subplot(1, 3, 3)
cov_draws(pdf3_samples,
xlim=(2.8, 3.6), ylim=(2.8, 3.3),
color='darkviolet', label='PDF Draws')
As derived above and shown previously, the "right" thing to do is to sample from the posterior distribution:
$$ \ln P(\boldsymbol{\rho}|\lbrace \mathbf{p}_g \rbrace) = \ln P(\lbrace \mathbf{p}_g \rbrace | \boldsymbol{\rho}) + \ln P(p_\mathbf{g}) - \ln P(\lbrace \mathbf{p}_g \rbrace) = \sum_g \ln\left( \mathbf{p}_g \cdot \boldsymbol{\rho} \right) + \ln P(\boldsymbol{\rho}) - \ln P(\lbrace \mathbf{p}_g \rbrace) $$where $P(\lbrace \mathbf{p}_g \rbrace)$ is a constant that can be ignored and $\cdot$ is the dot product.
As before, we will take $P(\boldsymbol{\rho})$ to be a Dirichlet distribution:
$$ \boldsymbol{\rho} \sim {\rm Dir}\left(\mathbf{m} + \boldsymbol{\alpha}\right) $$where $\boldsymbol{\alpha} = \mathbf{1}$ are a set of concentration parameters (with 1 being uniform) and $\mathbf{m}$ being a set of counts we've previously observed. We will prove this result later.
In [15]:
# grab representative set of previous redshifts
Nref = 1000
redshifts_ref = survey.data['redshifts'][-Nref:]
alpha = np.ones(Nbins)
counts_ref, _ = np.histogram(redshifts_ref, bins=zbins)
# plotting histogrammed representation
plt.figure(figsize=(14, 6))
plt.hist(redshifts, bins=zbins, histtype='stepfilled', lw=5,
color='blue', alpha=0.7, density=True, edgecolor='black',
label='Underlying')
plt.hist(redshifts_ref, bins=zbins, histtype='step', lw=5,
color='red', alpha=0.7, density=True, label='Reference')
plt.xlabel('Redshift')
plt.xlim([zbins[0], zbins[-1]])
plt.yticks([])
plt.ylabel(r'$P(z|\mathbf{g})$')
plt.legend(fontsize='small')
plt.tight_layout()
In [16]:
from frankenz import samplers
# define our prior
def logprior(x, alpha=None, counts_ref=None):
if alpha is None:
alpha = np.ones_like(x)
if counts_ref is None:
counts_ref = np.zeros_like(x)
if np.any(x < 0.):
return -np.inf
return scipy.stats.dirichlet.logpdf(x, alpha + counts_ref)
# initialize sampler
sampler_pop = samplers.population_sampler(zpdf_norm)
# run MH-in-Gibbs MCMC
Nburn = 100
sampler_pop.run_mcmc(Nsamples + Nburn, logprior_nz=logprior, prior_args=[alpha, counts_ref])
In [17]:
# grab samples
pdf4_samples, pdf4_lnps = sampler_pop.results
pdf4_samples = pdf4_samples[-500:] * Nobs # truncate and rescale
In [18]:
# plotting
plt.figure(figsize=(30, 12))
plt.subplot(2, 3, 1)
zplot_bin(pdf1_samples, label='Poisson', color='blue')
plt.subplot(2, 3, 2)
zplot_bin(pdf2_samples, label='Multinomial', color='red')
plt.subplot(2, 3, 3)
zplot_bin(pdf3_samples, label='PDF Draws', color='darkviolet')
plt.subplot(2, 3, 4)
zplot_bin(pdf4_samples, label='Population', color='darkgoldenrod')
In [20]:
# plotting binned covariance
plt.figure(figsize=(30, 18))
plt.subplot(2, 3, 1)
cov_draws(pdf1_samples,
xlim=(2.8, 4.1), ylim=(2.8, 3.7),
color='blue', label='Poisson')
plt.subplot(2, 3, 2)
cov_draws(pdf2_samples,
xlim=(2.8, 4.1), ylim=(2.8, 3.7),
color='red', label='Multinomial')
plt.subplot(2, 3, 3)
cov_draws(pdf3_samples,
xlim=(2.8, 4.1), ylim=(2.8, 3.7),
color='darkviolet', label='PDF Draws')
plt.subplot(2, 3, 4)
cov_draws(pdf4_samples,
xlim=(2.8, 4.1), ylim=(2.8, 3.7),
color='darkgoldenrod', label='Posterior')
One of the issues with the approach above is that we're trying to estimate the distribution of the population $\boldsymbol{\rho}$ (i.e. the "prior") while ignoring the impact on the individual PDFs $\mathbf{p}_g$. Conceptually, the two should be linked, since our estimate of the population should better inform our guesses for each individual object. More formally, this means we want to estimate
$$ P(\boldsymbol{\rho}, \lbrace z_g \rbrace |\lbrace \mathbf{L}_g \rbrace) $$where $\mathbf{L}_g \equiv P(g|z) \equiv \mathcal{L}(g|z)$ is now the likelihood rather than the posterior. It is extremely important that we are now conditioning on the likelihoods: the posterior of each object already applies some (implicit) prior, and we are now interested in modeling that explicitly.
We can draw samples from this distribution using Gibbs sampling by iteratively sampling from
$$ {\rm Object\:step\!:} \quad P(\lbrace z_g' \rbrace |\lbrace \mathbf{L}_g \rbrace, \boldsymbol{\rho}') = \prod_g P(z_g' |\mathbf{p}_g') $$and
$$ {\rm Population\:step\!:} \quad P(\boldsymbol{\rho}' |\lbrace z_g' \rbrace, \lbrace \mathbf{L}_g \rbrace) = P(\boldsymbol{\rho}' |\lbrace z_g' \rbrace) = P(\boldsymbol{\rho}' | \mathbf{n}') \propto P(\mathbf{n}' | \boldsymbol{\rho}') P(\boldsymbol{\rho}') $$in turn.
The first term is straightforward: assuming independence, we are just trying to simulate the redshift to each galaxy $g$ given its PDF
$$ \mathbf{p}_g' = \frac{\mathbf{L}_g \odot \boldsymbol{\rho}'}{\mathbf{L}_g \cdot \boldsymbol{\rho}'} $$where $\odot$ is element-wise multiplication (i.e. the Hadamard product), $\cdot$ is the dot product, and $\boldsymbol{\rho}$ is our current best guess of the prior. Since we only know whether a redshift is in a given bin, this becomes draw from the Categorical distribution:
$$ z_g' \sim \textrm{Cat}\left(\mathbf{p}_g\right) $$The second term has two components. The first is the likelihood $P(\mathbf{n}'|\boldsymbol{\rho}')$, which states how likely we are to observe the set of redshifts $\lbrace z_g'\rbrace$ given our assumed population prior $\boldsymbol{\rho}'$. Using our result above, we've exploited the fact that
$$ z_g' \sim \textrm{Cat}\left(\mathbf{p}_g\right) \quad \Leftrightarrow \quad \mathbf{n}_g' \sim \textrm{Mult}\left(1, \mathbf{p}_g\right) $$to write the collection of all the redshifts $\lbrace z_g' \rbrace$ as
$$ \mathbf{n}' = \sum_{g} \mathbf{n}_g' $$which is just the number of counts in each bin. Since these are just observed counts, this naturally implies that the final result is also multinomial distributed:
$$ \mathbf{n}' \sim {\rm Mult}\left(N_\mathbf{g}, \boldsymbol{\rho}'\right) $$Unfortunately, we are interested in $\boldsymbol{\rho}'$, not $\mathbf{n}'$. So for a given $P(\boldsymbol{\rho}')$ we'll actually have to sample $\boldsymbol{\rho}'$ from the posterior distribution analagous to our population inference case.
To avoid having to sample from $P(\boldsymbol{\rho}' | \mathbf{n}')$ directly, we can exploit conjugate priors to get the posterior in an analytic form. For this multinomial, the conjugate prior is the Dirichlet distribution. Taking the prior to be
$$ \boldsymbol{\rho}' \sim {\rm Dir}\left(\boldsymbol{\alpha}\right) $$with concentration parameters $\boldsymbol{\alpha}$, for a given $\mathbf{n}'$ the posterior for $\boldsymbol{\rho}'$ becomes
$$ \boldsymbol{\rho}' \sim {\rm Dir}\left(\mathbf{n}' + \boldsymbol{\alpha}\right) $$Assuming we have a reference set of $\mathbf{m}$ counts we've observed in a previous sample, we can apply the same logic to arrive at a modified prior
$$ \boldsymbol{\rho} \sim {\rm Dir}\left(\mathbf{m} + \boldsymbol{\alpha}\right) $$This is straightforward to simulate from.
Like with our population model, frankenz
also contains an "out-of-the-box" MCMC sampler to sampler from our hierarchical model. You can pass your own set of alpha
values if you like; otherwise alpha=1
is assumed.
In [21]:
# initialize sampler
sampler_hb = samplers.hierarchical_sampler(zpdf_norm)
# run HB Gibbs sampler
sampler_hb.run_mcmc(Nsamples + Nburn, alpha=alpha+counts_ref)
Due to the fact that we can generate samples from both of our conditional distribution analytically, the hiearchical model -- in addition to being the more "natural" model -- actually is easier (and quicker) to sampler from than the population model.
In [22]:
# grab samples
pdf5_samples, pdf5_lnps = sampler_hb.results
pdf5_samples = pdf5_samples[-500:] * Nobs # truncate and rescale
In [23]:
# plotting
plt.figure(figsize=(30, 12))
plt.subplot(2, 3, 1)
zplot_bin(pdf1_samples, label='Poisson', color='blue')
plt.subplot(2, 3, 2)
zplot_bin(pdf2_samples, label='Multinomial', color='red')
plt.subplot(2, 3, 3)
zplot_bin(pdf3_samples, label='PDF Draws', color='darkviolet')
plt.subplot(2, 3, 4)
zplot_bin(pdf4_samples, label='Population', color='darkgoldenrod')
plt.subplot(2, 3, 5)
zplot_bin(pdf5_samples, label='Hierarchical (1-level)', color='fuchsia')
In [24]:
# plotting binned covariance
plt.figure(figsize=(30, 18))
plt.subplot(2, 3, 1)
cov_draws(pdf1_samples,
xlim=(2.8, 4.0), ylim=(2.8, 3.8),
color='blue', label='Poisson')
plt.subplot(2, 3, 2)
cov_draws(pdf2_samples,
xlim=(2.8, 4.0), ylim=(2.8, 3.8),
color='red', label='Multinomial')
plt.subplot(2, 3, 3)
cov_draws(pdf3_samples,
xlim=(2.8, 4.0), ylim=(2.8, 3.8),
color='darkviolet', label='PDF Draws')
plt.subplot(2, 3, 4)
cov_draws(pdf4_samples,
xlim=(2.8, 4.0), ylim=(2.8, 3.8),
color='darkgoldenrod', label='Posterior')
plt.subplot(2, 3, 5)
cov_draws(pdf5_samples,
xlim=(2.8, 4.0), ylim=(2.8, 3.8),
color='fuchsia', label='Hierarchical\n(1-level)')
While the above results have looked at the marginalized distribution of $\boldsymbol{\rho}$, we can use our results to generate the marginalized distributions for each of our objects $P(z|g)$. Because these apply the prior learned from the population, the individual PDFs are often better-constrained as a result. This effect is known as hierarchical shrinkage.
In [25]:
# plot new PDFs
plt.figure(figsize=(20, 12))
for i, (j, c) in enumerate(zip(idxs, colors)):
plt.subplot(Nfigs[0], Nfigs[1], i + 1)
n1, _, _ = plt.hist(zbins_mid, zbins, weights=zpdf_bins[j],
color=c, lw=4, alpha=0.3, histtype='step',
density=True)
zpdf_bins_t = np.sum([np.random.multinomial(10, (zpdf_bins[j] * p /
np.dot(zpdf_bins[j], p)))
for p in pdf5_samples], axis=0)
n2, _, _ = plt.hist(zbins_mid, zbins, weights=zpdf_bins_t,
color=c, lw=4, alpha=0.8, histtype='step',
density=True)
plt.vlines(redshifts[j], 0., np.max([n1, n2]) * 1.2, color='red',
lw=3)
plt.xlim([-0.5, 6])
plt.ylim([0.03, None])
plt.xlabel('Redshift')
plt.yticks([])
plt.ylabel('PDF')
plt.tight_layout()
One issue the above models have is that they assume that we have a "representative" reference sample. What happens in the case where we think our reference sample might be unrepresentative? Well, then we can pretend we have a "virtual" reference set with counts $\mathbf{k}'$ and add in a third step to our model:
$$ {\rm Reference\:step\!:} \quad P(\mathbf{k}' |\boldsymbol{\rho}', \lbrace z_g' \rbrace, \lbrace \mathbf{L}_g \rbrace) = P(\mathbf{k}' | \boldsymbol{\rho}') $$Sampling from this turns out to be relatively straightforward:
$$ \mathbf{k}' \sim {\rm Mult}\left(N_{\rm ref}, \frac{\mathbf{m} + \boldsymbol{\beta} + N_\mathbf{g} \boldsymbol{\rho}'}{N_{\rm ref} + N_{\boldsymbol{\beta}} + N_\mathbf{g}}\right) $$where $\boldsymbol{\beta}$ comes from another Dirichlet hyper-prior which we by default take to be $1$.
We then use our "virtual" reference counts $\mathbf{k}'$ in the remainder of the sampling phases
This now represents a compromise between the reference sample we observed and the distribution constrained by our data. In the case where $N_{\rm ref} \gg N_\mathbf{g}$, this means $\mathbf{k}$ will just include possible multinomial sampling error. In the case where $N_\mathbf{g} \gg N_{\rm ref}$, this will reduce to applying random sampling from our estimated $\boldsymbol{\rho}$ and will have minimal impact on the analysis.
We can sample with this additional step by passing the reference counts to ref_sample
when sampling.
In [26]:
beta = np.ones(Nbins)
# run HB Gibbs sampler w/ explicit reference set
sampler_hb2 = samplers.hierarchical_sampler(zpdf_norm)
sampler_hb2.run_mcmc(Nsamples + Nburn, alpha=alpha, ref_sample=counts_ref, beta=beta)
Due to the fact that we can generate samples from both of our conditional distribution analytically, the hiearchical model -- in addition to being the more "natural" model -- actually is easier (and quicker) to sampler from than the population model.
In [27]:
# grab samples
pdf6_samples, pdf6_lnps = sampler_hb2.results
pdf6_samples = pdf6_samples[-500:] * Nobs # truncate and rescale
In [28]:
# plotting
plt.figure(figsize=(30, 12))
plt.subplot(2, 3, 1)
zplot_bin(pdf1_samples, label='Poisson', color='blue')
plt.subplot(2, 3, 2)
zplot_bin(pdf2_samples, label='Multinomial', color='red')
plt.subplot(2, 3, 3)
zplot_bin(pdf3_samples, label='PDF Draws', color='darkviolet')
plt.subplot(2, 3, 4)
zplot_bin(pdf4_samples, label='Population', color='darkgoldenrod')
plt.subplot(2, 3, 5)
zplot_bin(pdf5_samples, label='Hierarchical, fixed ref', color='fuchsia')
plt.subplot(2, 3, 6)
zplot_bin(pdf6_samples, label='Hierarchical, flex ref', color='seagreen')
In [32]:
# plotting binned covariance
plt.figure(figsize=(30, 18))
plt.subplot(2, 3, 1)
cov_draws(pdf1_samples,
xlim=(2.8, 4.2), ylim=(2.5, 3.9),
color='blue', label='Poisson')
plt.subplot(2, 3, 2)
cov_draws(pdf2_samples,
xlim=(2.8, 4.2), ylim=(2.5, 3.9),
color='red', label='Multinomial')
plt.subplot(2, 3, 3)
cov_draws(pdf3_samples,
xlim=(2.8, 4.2), ylim=(2.5, 3.9),
color='darkviolet', label='PDF Draws')
plt.subplot(2, 3, 4)
cov_draws(pdf4_samples,
xlim=(2.8, 4.2), ylim=(2.5, 3.9),
color='darkgoldenrod', label='Posterior')
plt.subplot(2, 3, 5)
cov_draws(pdf5_samples,
xlim=(2.8, 4.2), ylim=(2.5, 3.9),
color='fuchsia', label='Hierarchical\n(fixed ref)')
plt.subplot(2, 3, 6)
cov_draws(pdf6_samples,
xlim=(2.8, 4.2), ylim=(2.5, 3.9),
color='seagreen', label='Hierarchical\n(flex ref)')
In [33]:
# plot new PDFs
plt.figure(figsize=(20, 12))
for i, (j, c) in enumerate(zip(idxs, colors)):
plt.subplot(Nfigs[0], Nfigs[1], i + 1)
n1, _, _ = plt.hist(zbins_mid, zbins, weights=zpdf_bins[j],
color=c, lw=4, alpha=0.3, histtype='step',
density=True)
zpdf_bins_t = np.sum([np.random.multinomial(10, (zpdf_bins[j] * p /
np.dot(zpdf_bins[j], p)))
for p in pdf6_samples], axis=0)
n2, _, _ = plt.hist(zbins_mid, zbins, weights=zpdf_bins_t,
color=c, lw=4, alpha=0.8, histtype='step',
density=True)
plt.vlines(redshifts[j], 0., np.max([n1, n2]) * 1.2, color='red',
lw=3)
plt.xlim([-0.5, 6])
plt.ylim([0.03, None])
plt.xlabel('Redshift')
plt.yticks([])
plt.ylabel('PDF')
plt.tight_layout()
As we expect, this higher-level model where we allow the reference set to vary based on our estimates of the population increases the uncertainty in our inference.